Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flakey TestPluginCleanup in CI #49104

Closed
wants to merge 1 commit into from

Conversation

kiosion
Copy link
Contributor

@kiosion kiosion commented Nov 16, 2024

Acquire lock for all Access List resources before locking individual Access List in ListAccessListMembers.

What seemed to happen in CI is if ListAccessListMembers was called, it would lock the individual list. If IneligibilityReconciler then ran simultaneously, it would lock all Access Lists, before failing to acquire the individual Access List's lock, leading to a deadlock.

This didn't seem to come up in real-world conditions, but was causing flakey failures in CI when running many times at once.

Resolves #5456.

Acquire lock for all Access List resources before locking individual
Access List in `ListAccessListMembers`.

What seemed to happen in CI was `ListAccessListMembers` was called,
thereby locking the individual list. If IneligibilityReconciler ran
simultaneously, it would lock all Access Lists, then fail to acquire
the individual Access List's lock, leading to a deadlock.

This didn't seem to come up in real-world conditions, but was causing
flakey failures in CI when running many times at once.
@kiosion kiosion added the no-changelog Indicates that a PR does not require a changelog entry label Nov 16, 2024
@kiosion kiosion requested a review from smallinsky November 16, 2024 00:11
err := a.service.RunWhileLocked(ctx, lockName(accessListName), accessListLockTTL, func(ctx context.Context, _ backend.Backend) error {
_, err := a.service.GetResource(ctx, accessListName)
if err != nil {
err := a.service.RunWhileLocked(ctx, []string{accessListResourceLockName}, accessListLockTTL, func(ctx context.Context, _ backend.Backend) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So only one auth server can ever list access list members at a time?

Any scale concerns here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly... Going over this in more detail today; I'm not really satisfied with this fix. It might work to not lock all lists on update/upsert of a single list instead.

@smallinsky
Copy link
Contributor

smallinsky commented Nov 18, 2024

If this deadlock can only happen during testing and not in real-world scenarios:

This didn’t seem to come up in real-world conditions.

why not align the tests accordingly and leave the implementation unchanged?

@kiosion
Copy link
Contributor Author

kiosion commented Nov 18, 2024

Found the actual cause – opened #5530.

@kiosion kiosion closed this Nov 18, 2024
@kiosion kiosion deleted the maxim/fix-flakey-test-cleanup branch November 18, 2024 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-changelog Indicates that a PR does not require a changelog entry size/sm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants